The goal of this task is to provide an analysis of a policing dataset from Dallas, Texas in 2016. This task is important as it will help us identify patterns and relationships between variables thus we can understand how they impact each other. When dealing with police data, analyzing trends can help identify patterns and relationships between variables such as crime types, locations, and time of day. This can provide insights into where and when crimes are most likely to occur, which can help police departments to allocate resources more effectively. Analyzing trends can also help predict future crime trends and patterns, which can be used to prevent crime and improve public safety. Identifying an increase in a certain type of crime would be a signal to the police departments to increase their focus and efforts on those areas to prevent further incidents from occurring. Identifying outliers in policing data can also be crucial. Outliers in crime data could be indicators of unique crime events that do not occur often. These events may require special attention and thus this information is important to police departments as they may need to be prepared in the event that they take place. Data is collected and analysed so that it can be used to make informed decisions. The analysis of a data set helps people identify trends, and patterns that can improve decision-making. Areas that need improvement, resources that need to be allocated and change that need to be made can be informed through analysis activities. Data-driven decisions are always informed decisions. This is particularly important for a police department as they can use the information gained to prevent crime and improve public safety.
Conducting an initial exploration of a given data set is a crucial step. This task helps the analyst gain a better understanding of the data and the variables in the dataset. This understanding is what helps in developing hypotheses and questions that guide further analysis. Exploring a data set helps one to identify errors, inconsistencies, or missing values in the data. Dealing with errors and discrepancies can help ensure that the data is accurate and reliable. This is important as errors can affect the results of the analysis. Exploring the data set acts as a guide on what data preprocessing or cleaning tasks need to be done. Preprocessing tasks include transforming variables to their right form and imputing missing values where necessary.
The first task involves introducing the data set into the environment. Creating a copy of a data frame preserves the original data collected. It also provides a level of safety and flexibility when working with data, and can help ensure that the data remains intact and accurate throughout the analysis.
data<-read.csv("37-00049_UOF-P_2016_prepped.csv")
df<- read.csv("37-00049_UOF-P_2016_prepped.csv") #Create a copy that will have no modifications
The data set in use contains 2384 observations of 47 variables. By looking at the data frame, it can be seen that all the variables have been classified as characters. Even those that appear to have numerical inputs
str(data)#gives some properties of the variables
## 'data.frame': 2384 obs. of 47 variables:
## $ INCIDENT_DATE : chr "OCCURRED_D" "9/3/16" "3/22/16" "5/22/16" ...
## $ INCIDENT_TIME : chr "OCCURRED_T" "4:14:00 AM" "11:00:00 PM" "1:29:00 PM" ...
## $ UOF_NUMBER : chr "UOFNum" "37702" "33413" "34567" ...
## $ OFFICER_ID : chr "CURRENT_BA" "10810" "7706" "11014" ...
## $ OFFICER_GENDER : chr "OffSex" "Male" "Male" "Male" ...
## $ OFFICER_RACE : chr "OffRace" "Black" "White" "Black" ...
## $ OFFICER_HIRE_DATE : chr "HIRE_DT" "5/7/14" "1/8/99" "5/20/15" ...
## $ OFFICER_YEARS_ON_FORCE : chr "INCIDENT_DATE_LESS_" "2" "17" "1" ...
## $ OFFICER_INJURY : chr "OFF_INJURE" "No" "Yes" "No" ...
## $ OFFICER_INJURY_TYPE : chr "OFF_INJURE_DESC" "No injuries noted or visible" "Sprain/Strain" "No injuries noted or visible" ...
## $ OFFICER_HOSPITALIZATION : chr "OFF_HOSPIT" "No" "Yes" "No" ...
## $ SUBJECT_ID : chr "CitNum" "46424" "44324" "45126" ...
## $ SUBJECT_RACE : chr "CitRace" "Black" "Hispanic" "Hispanic" ...
## $ SUBJECT_GENDER : chr "CitSex" "Female" "Male" "Male" ...
## $ SUBJECT_INJURY : chr "CIT_INJURE" "Yes" "No" "No" ...
## $ SUBJECT_INJURY_TYPE : chr "SUBJ_INJURE_DESC" "Non-Visible Injury/Pain" "No injuries noted or visible" "No injuries noted or visible" ...
## $ SUBJECT_WAS_ARRESTED : chr "CIT_ARREST" "Yes" "Yes" "Yes" ...
## $ SUBJECT_DESCRIPTION : chr "CIT_INFL_A" "Mentally unstable" "Mentally unstable" "Unknown" ...
## $ SUBJECT_OFFENSE : chr "CitChargeT" "APOWW" "APOWW" "APOWW" ...
## $ REPORTING_AREA : chr "RA" "2062" "1197" "4153" ...
## $ BEAT : chr "BEAT" "134" "237" "432" ...
## $ SECTOR : chr "SECTOR" "130" "230" "430" ...
## $ DIVISION : chr "DIVISION" "CENTRAL" "NORTHEAST" "SOUTHWEST" ...
## $ LOCATION_DISTRICT : chr "DIST_NAME" "D14" "D9" "D6" ...
## $ STREET_NUMBER : chr "STREET_N" "211" "7647" "716" ...
## $ STREET_NAME : chr "STREET" "Ervay" "Ferguson" "bimebella dr" ...
## $ STREET_DIRECTION : chr "street_g" "N" "NULL" "NULL" ...
## $ STREET_TYPE : chr "street_t" "St." "Rd." "Ln." ...
## $ LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION: chr "Street Address" "211 N ERVAY ST" "7647 FERGUSON RD" "716 BIMEBELLA LN" ...
## $ LOCATION_CITY : chr "City" "Dallas" "Dallas" "Dallas" ...
## $ LOCATION_STATE : chr "State" "TX" "TX" "TX" ...
## $ LOCATION_LATITUDE : chr "Latitude" "32.782205" "32.798978" "32.73971" ...
## $ LOCATION_LONGITUDE : chr "Longitude" "-96.797461" "-96.717493" "-96.92519" ...
## $ INCIDENT_REASON : chr "SERVICE_TY" "Arrest" "Arrest" "Arrest" ...
## $ REASON_FOR_FORCE : chr "UOF_REASON" "Arrest" "Arrest" "Arrest" ...
## $ TYPE_OF_FORCE_USED1 : chr "ForceType1" "Hand/Arm/Elbow Strike" "Joint Locks" "Take Down - Group" ...
## $ TYPE_OF_FORCE_USED2 : chr "ForceType2" "" "" "" ...
## $ TYPE_OF_FORCE_USED3 : chr "ForceType3" "" "" "" ...
## $ TYPE_OF_FORCE_USED4 : chr "ForceType4" "" "" "" ...
## $ TYPE_OF_FORCE_USED5 : chr "ForceType5" "" "" "" ...
## $ TYPE_OF_FORCE_USED6 : chr "ForceType6" "" "" "" ...
## $ TYPE_OF_FORCE_USED7 : chr "ForceType7" "" "" "" ...
## $ TYPE_OF_FORCE_USED8 : chr "ForceType8" "" "" "" ...
## $ TYPE_OF_FORCE_USED9 : chr "ForceType9" "" "" "" ...
## $ TYPE_OF_FORCE_USED10 : chr "ForceType10" "" "" "" ...
## $ NUMBER_EC_CYCLES : chr "Cycles_Num" "NULL" "NULL" "NULL" ...
## $ FORCE_EFFECTIVE : chr "ForceEffec" " Yes" " Yes" " Yes" ...
dim(data)#gives dimensions of the data frame
## [1] 2384 47
names(data)# gives names of variables
## [1] "INCIDENT_DATE"
## [2] "INCIDENT_TIME"
## [3] "UOF_NUMBER"
## [4] "OFFICER_ID"
## [5] "OFFICER_GENDER"
## [6] "OFFICER_RACE"
## [7] "OFFICER_HIRE_DATE"
## [8] "OFFICER_YEARS_ON_FORCE"
## [9] "OFFICER_INJURY"
## [10] "OFFICER_INJURY_TYPE"
## [11] "OFFICER_HOSPITALIZATION"
## [12] "SUBJECT_ID"
## [13] "SUBJECT_RACE"
## [14] "SUBJECT_GENDER"
## [15] "SUBJECT_INJURY"
## [16] "SUBJECT_INJURY_TYPE"
## [17] "SUBJECT_WAS_ARRESTED"
## [18] "SUBJECT_DESCRIPTION"
## [19] "SUBJECT_OFFENSE"
## [20] "REPORTING_AREA"
## [21] "BEAT"
## [22] "SECTOR"
## [23] "DIVISION"
## [24] "LOCATION_DISTRICT"
## [25] "STREET_NUMBER"
## [26] "STREET_NAME"
## [27] "STREET_DIRECTION"
## [28] "STREET_TYPE"
## [29] "LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION"
## [30] "LOCATION_CITY"
## [31] "LOCATION_STATE"
## [32] "LOCATION_LATITUDE"
## [33] "LOCATION_LONGITUDE"
## [34] "INCIDENT_REASON"
## [35] "REASON_FOR_FORCE"
## [36] "TYPE_OF_FORCE_USED1"
## [37] "TYPE_OF_FORCE_USED2"
## [38] "TYPE_OF_FORCE_USED3"
## [39] "TYPE_OF_FORCE_USED4"
## [40] "TYPE_OF_FORCE_USED5"
## [41] "TYPE_OF_FORCE_USED6"
## [42] "TYPE_OF_FORCE_USED7"
## [43] "TYPE_OF_FORCE_USED8"
## [44] "TYPE_OF_FORCE_USED9"
## [45] "TYPE_OF_FORCE_USED10"
## [46] "NUMBER_EC_CYCLES"
## [47] "FORCE_EFFECTIVE"
Checking for duplicates is done to ensure the data we are analyzing is accurate and of high quality. It is observed that the data set has no duplicates.
# Get the logical vector indicating duplicates
duplicates_mask <- duplicated(data)
# Select only the first 100 rows of the logical vector
first_100_duplicates <- head(duplicates_mask, n = 100)
# Print some of the duplicates
print(first_100_duplicates)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
From the initial exploration it was seen that the first row in the data frame contains the labels of the variables. This is redundant given that the variables already have labels. It also means that the labels are counted as observations. The number of observations in the data frame has now reduced to 2383.
data <- data[2:nrow(data), ]
dim(data)
## [1] 2383 47
The next step involves identifying the type of variables in the data frame this is important as it will help in selecting the appropriate data cleaning and analysis techniques. It will also help in the identification of potential errors that could prevent a consistent analysis from being done. This step is crucial in ensuring that our findings are communicated effectively.
It is observed that all variables have been classified as characters despite some being made to represent dates and numbers.
sapply(data, function(x) is.character(x))
## INCIDENT_DATE
## TRUE
## INCIDENT_TIME
## TRUE
## UOF_NUMBER
## TRUE
## OFFICER_ID
## TRUE
## OFFICER_GENDER
## TRUE
## OFFICER_RACE
## TRUE
## OFFICER_HIRE_DATE
## TRUE
## OFFICER_YEARS_ON_FORCE
## TRUE
## OFFICER_INJURY
## TRUE
## OFFICER_INJURY_TYPE
## TRUE
## OFFICER_HOSPITALIZATION
## TRUE
## SUBJECT_ID
## TRUE
## SUBJECT_RACE
## TRUE
## SUBJECT_GENDER
## TRUE
## SUBJECT_INJURY
## TRUE
## SUBJECT_INJURY_TYPE
## TRUE
## SUBJECT_WAS_ARRESTED
## TRUE
## SUBJECT_DESCRIPTION
## TRUE
## SUBJECT_OFFENSE
## TRUE
## REPORTING_AREA
## TRUE
## BEAT
## TRUE
## SECTOR
## TRUE
## DIVISION
## TRUE
## LOCATION_DISTRICT
## TRUE
## STREET_NUMBER
## TRUE
## STREET_NAME
## TRUE
## STREET_DIRECTION
## TRUE
## STREET_TYPE
## TRUE
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION
## TRUE
## LOCATION_CITY
## TRUE
## LOCATION_STATE
## TRUE
## LOCATION_LATITUDE
## TRUE
## LOCATION_LONGITUDE
## TRUE
## INCIDENT_REASON
## TRUE
## REASON_FOR_FORCE
## TRUE
## TYPE_OF_FORCE_USED1
## TRUE
## TYPE_OF_FORCE_USED2
## TRUE
## TYPE_OF_FORCE_USED3
## TRUE
## TYPE_OF_FORCE_USED4
## TRUE
## TYPE_OF_FORCE_USED5
## TRUE
## TYPE_OF_FORCE_USED6
## TRUE
## TYPE_OF_FORCE_USED7
## TRUE
## TYPE_OF_FORCE_USED8
## TRUE
## TYPE_OF_FORCE_USED9
## TRUE
## TYPE_OF_FORCE_USED10
## TRUE
## NUMBER_EC_CYCLES
## TRUE
## FORCE_EFFECTIVE
## TRUE
sapply(data, function(x) is.numeric(x)) # it seems all the variables are characters
## INCIDENT_DATE
## FALSE
## INCIDENT_TIME
## FALSE
## UOF_NUMBER
## FALSE
## OFFICER_ID
## FALSE
## OFFICER_GENDER
## FALSE
## OFFICER_RACE
## FALSE
## OFFICER_HIRE_DATE
## FALSE
## OFFICER_YEARS_ON_FORCE
## FALSE
## OFFICER_INJURY
## FALSE
## OFFICER_INJURY_TYPE
## FALSE
## OFFICER_HOSPITALIZATION
## FALSE
## SUBJECT_ID
## FALSE
## SUBJECT_RACE
## FALSE
## SUBJECT_GENDER
## FALSE
## SUBJECT_INJURY
## FALSE
## SUBJECT_INJURY_TYPE
## FALSE
## SUBJECT_WAS_ARRESTED
## FALSE
## SUBJECT_DESCRIPTION
## FALSE
## SUBJECT_OFFENSE
## FALSE
## REPORTING_AREA
## FALSE
## BEAT
## FALSE
## SECTOR
## FALSE
## DIVISION
## FALSE
## LOCATION_DISTRICT
## FALSE
## STREET_NUMBER
## FALSE
## STREET_NAME
## FALSE
## STREET_DIRECTION
## FALSE
## STREET_TYPE
## FALSE
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION
## FALSE
## LOCATION_CITY
## FALSE
## LOCATION_STATE
## FALSE
## LOCATION_LATITUDE
## FALSE
## LOCATION_LONGITUDE
## FALSE
## INCIDENT_REASON
## FALSE
## REASON_FOR_FORCE
## FALSE
## TYPE_OF_FORCE_USED1
## FALSE
## TYPE_OF_FORCE_USED2
## FALSE
## TYPE_OF_FORCE_USED3
## FALSE
## TYPE_OF_FORCE_USED4
## FALSE
## TYPE_OF_FORCE_USED5
## FALSE
## TYPE_OF_FORCE_USED6
## FALSE
## TYPE_OF_FORCE_USED7
## FALSE
## TYPE_OF_FORCE_USED8
## FALSE
## TYPE_OF_FORCE_USED9
## FALSE
## TYPE_OF_FORCE_USED10
## FALSE
## NUMBER_EC_CYCLES
## FALSE
## FORCE_EFFECTIVE
## FALSE
head(data) # this should not be the case as we can see some dates and numbers
## INCIDENT_DATE INCIDENT_TIME UOF_NUMBER OFFICER_ID OFFICER_GENDER
## 2 9/3/16 4:14:00 AM 37702 10810 Male
## 3 3/22/16 11:00:00 PM 33413 7706 Male
## 4 5/22/16 1:29:00 PM 34567 11014 Male
## 5 1/10/16 8:55:00 PM 31460 6692 Male
## 6 11/8/16 2:30:00 AM 37879, 37898 9844 Male
## 7 9/11/16 7:20:00 PM 36724 9855 Male
## OFFICER_RACE OFFICER_HIRE_DATE OFFICER_YEARS_ON_FORCE OFFICER_INJURY
## 2 Black 5/7/14 2 No
## 3 White 1/8/99 17 Yes
## 4 Black 5/20/15 1 No
## 5 Black 7/29/91 24 No
## 6 White 10/4/09 7 No
## 7 White 6/10/09 7 No
## OFFICER_INJURY_TYPE OFFICER_HOSPITALIZATION SUBJECT_ID SUBJECT_RACE
## 2 No injuries noted or visible No 46424 Black
## 3 Sprain/Strain Yes 44324 Hispanic
## 4 No injuries noted or visible No 45126 Hispanic
## 5 No injuries noted or visible No 43150 Hispanic
## 6 No injuries noted or visible No 47307 Black
## 7 No injuries noted or visible No 46549 White
## SUBJECT_GENDER SUBJECT_INJURY SUBJECT_INJURY_TYPE
## 2 Female Yes Non-Visible Injury/Pain
## 3 Male No No injuries noted or visible
## 4 Male No No injuries noted or visible
## 5 Male Yes Laceration/Cut
## 6 Male No No injuries noted or visible
## 7 Female No No injuries noted or visible
## SUBJECT_WAS_ARRESTED SUBJECT_DESCRIPTION SUBJECT_OFFENSE
## 2 Yes Mentally unstable APOWW
## 3 Yes Mentally unstable APOWW
## 4 Yes Unknown APOWW
## 5 Yes FD-Unknown if Armed Evading Arrest
## 6 Yes Unknown Other Misdemeanor Arrest
## 7 Yes Unknown Assault/FV
## REPORTING_AREA BEAT SECTOR DIVISION LOCATION_DISTRICT STREET_NUMBER
## 2 2062 134 130 CENTRAL D14 211
## 3 1197 237 230 NORTHEAST D9 7647
## 4 4153 432 430 SOUTHWEST D6 716
## 5 4523 641 640 NORTH CENTRAL D11 5600
## 6 2167 346 340 SOUTHEAST D7 4600
## 7 1134 235 230 NORTHEAST D9 1234
## STREET_NAME STREET_DIRECTION STREET_TYPE
## 2 Ervay N St.
## 3 Ferguson NULL Rd.
## 4 bimebella dr NULL Ln.
## 5 LBJ NULL Frwy.
## 6 Malcolm X S Blvd.
## 7 Peavy NULL Rd.
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION LOCATION_CITY LOCATION_STATE
## 2 211 N ERVAY ST Dallas TX
## 3 7647 FERGUSON RD Dallas TX
## 4 716 BIMEBELLA LN Dallas TX
## 5 5600 L B J FWY Dallas TX
## 6 4600 S MALCOLM X BLVD Dallas TX
## 7 1234 PEAVY RD Dallas TX
## LOCATION_LATITUDE LOCATION_LONGITUDE INCIDENT_REASON REASON_FOR_FORCE
## 2 32.782205 -96.797461 Arrest Arrest
## 3 32.798978 -96.717493 Arrest Arrest
## 4 32.73971 -96.92519 Arrest Arrest
## 5 Arrest Arrest
## 6 Arrest Arrest
## 7 32.837527 -96.695566 Arrest Arrest
## TYPE_OF_FORCE_USED1 TYPE_OF_FORCE_USED2 TYPE_OF_FORCE_USED3
## 2 Hand/Arm/Elbow Strike
## 3 Joint Locks
## 4 Take Down - Group
## 5 K-9 Deployment
## 6 Verbal Command Take Down - Arm
## 7 Hand Controlled Escort
## TYPE_OF_FORCE_USED4 TYPE_OF_FORCE_USED5 TYPE_OF_FORCE_USED6
## 2
## 3
## 4
## 5
## 6
## 7
## TYPE_OF_FORCE_USED7 TYPE_OF_FORCE_USED8 TYPE_OF_FORCE_USED9
## 2
## 3
## 4
## 5
## 6
## 7
## TYPE_OF_FORCE_USED10 NUMBER_EC_CYCLES FORCE_EFFECTIVE
## 2 NULL Yes
## 3 NULL Yes
## 4 NULL Yes
## 5 NULL Yes
## 6 NULL No, Yes
## 7 NULL Yes
Further transformations of need to be done to ensure the appropriate analysis techniques are done. Numerical, categorical and time series data can not be analysed in the same way. Obtaining the names of the variables gives an understanding of what they mean and how they should be transformed if they are required to.
names(data)
## [1] "INCIDENT_DATE"
## [2] "INCIDENT_TIME"
## [3] "UOF_NUMBER"
## [4] "OFFICER_ID"
## [5] "OFFICER_GENDER"
## [6] "OFFICER_RACE"
## [7] "OFFICER_HIRE_DATE"
## [8] "OFFICER_YEARS_ON_FORCE"
## [9] "OFFICER_INJURY"
## [10] "OFFICER_INJURY_TYPE"
## [11] "OFFICER_HOSPITALIZATION"
## [12] "SUBJECT_ID"
## [13] "SUBJECT_RACE"
## [14] "SUBJECT_GENDER"
## [15] "SUBJECT_INJURY"
## [16] "SUBJECT_INJURY_TYPE"
## [17] "SUBJECT_WAS_ARRESTED"
## [18] "SUBJECT_DESCRIPTION"
## [19] "SUBJECT_OFFENSE"
## [20] "REPORTING_AREA"
## [21] "BEAT"
## [22] "SECTOR"
## [23] "DIVISION"
## [24] "LOCATION_DISTRICT"
## [25] "STREET_NUMBER"
## [26] "STREET_NAME"
## [27] "STREET_DIRECTION"
## [28] "STREET_TYPE"
## [29] "LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION"
## [30] "LOCATION_CITY"
## [31] "LOCATION_STATE"
## [32] "LOCATION_LATITUDE"
## [33] "LOCATION_LONGITUDE"
## [34] "INCIDENT_REASON"
## [35] "REASON_FOR_FORCE"
## [36] "TYPE_OF_FORCE_USED1"
## [37] "TYPE_OF_FORCE_USED2"
## [38] "TYPE_OF_FORCE_USED3"
## [39] "TYPE_OF_FORCE_USED4"
## [40] "TYPE_OF_FORCE_USED5"
## [41] "TYPE_OF_FORCE_USED6"
## [42] "TYPE_OF_FORCE_USED7"
## [43] "TYPE_OF_FORCE_USED8"
## [44] "TYPE_OF_FORCE_USED9"
## [45] "TYPE_OF_FORCE_USED10"
## [46] "NUMBER_EC_CYCLES"
## [47] "FORCE_EFFECTIVE"
There are two variables that represent the moment and incident occurred. They are in the form of date and time.These variables are combined as they would help in looking for patterns in a data set over time a given. Combining these variables would also help in the creation of more appealing visuals. The INCIDENT_TIME variable can be discarded after this transformation. It is worth noting that this variable is still classified as a character.
#
data$INCIDENT_DATE_AND_TIME <- paste(data$INCIDENT_DATE, data$INCIDENT_TIME)
head(data$INCIDENT_DATE_AND_TIME)
## [1] "9/3/16 4:14:00 AM" "3/22/16 11:00:00 PM" "5/22/16 1:29:00 PM"
## [4] "1/10/16 8:55:00 PM" "11/8/16 2:30:00 AM" "9/11/16 7:20:00 PM"
class(data$INCIDENT_DATE_AND_TIME ) # gives type of a variable
## [1] "character"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- select(data, -INCIDENT_TIME)
dim(data)#data frame should have 47 variables
## [1] 2383 47
The conversion of variables to their right form starts with those that represent time and date.The variable type used to represent time is in the “POSIXIt” class. The variable type used to represent the date is in the “Date” class.
The INCIDENT_DATE_AND_TIME variable is transformed to the “POSIXlt” “POSIXt” class. Whereas the OFFICER_HIRE_DATE variable is transformed into the “Date” class.
data$INCIDENT_DATE_AND_TIME <- strptime(data$INCIDENT_DATE_AND_TIME, format = "%m/%d/%y %I:%M:%S %p")
class(data$INCIDENT_DATE_AND_TIME)#ensure that it's in the right form
## [1] "POSIXlt" "POSIXt"
data$OFFICER_HIRE_DATE <- as.Date(data$OFFICER_HIRE_DATE, format = "%m/%d/%Y", na.strings = "NA")
class(data$OFFICER_HIRE_DATE)#ensure that it's in the right form
## [1] "Date"
data$INCIDENT_DATE <- as.Date(data$INCIDENT_DATE, format = "%m/%d/%Y", na.strings = "NA")
class(data$INCIDENT_DATE)#ensure that it's in the right form
## [1] "Date"
The following variables are transformed to numeric form
UOF_NUMBER
OFFICER_ID
OFFICER_YEARS_ON_FORCE
SUBJECT_ID
REPORTING_AREA
BEAT
SECTOR
STREET_NUMBER
LOCATION_LATITUDE
LOCATION_LONGITUDE
data$UOF_NUMBER<- as.numeric(as.character(data$UOF_NUMBER), na.rm = TRUE)
## Warning: NAs introduced by coercion
class(data$UOF_NUMBER) # check if the variable is numeric
## [1] "numeric"
data$OFFICER_ID<- as.numeric(as.character(data$OFFICER_ID), na.rm = TRUE)
class(data$OFFICER_ID) # check if the variable is numeric
## [1] "numeric"
data$OFFICER_YEARS_ON_FORCE <- as.numeric(as.character(data$OFFICER_YEARS_ON_FORCE), na.rm = TRUE)
class(data$OFFICER_YEARS_ON_FORCE) # check if the variable is numeric
## [1] "numeric"
data$SUBJECT_ID <- as.numeric(as.character(data$SUBJECT_ID), na.rm = TRUE)
class (data$SUBJECT_ID) # check if the variable is numeric
## [1] "numeric"
data$REPORTING_AREA <- as.numeric(as.character(data$REPORTING_AREA), na.rm = TRUE)
class(data$REPORTING_AREA) # check if the variable is numeric
## [1] "numeric"
data$BEAT <- as.numeric(as.character(data$BEAT), na.rm = TRUE)
class(data$BEAT) # check if the variable is numeric
## [1] "numeric"
data$SECTOR <- as.numeric(as.character(data$SECTOR), na.rm = TRUE)
class(data$SECTOR) # check if the variable is numeric
## [1] "numeric"
data$STREET_NUMBER <- as.numeric(as.character(data$STREET_NUMBER), na.rm = TRUE)
class(data$STREET_NUMBER) # check if the variable is numeric
## [1] "numeric"
data$LOCATION_LATITUDE <- as.numeric(as.character(data$LOCATION_LATITUDE), na.rm = TRUE)
class(data$LOCATION_LATITUDE) # check if the variable is numeric
## [1] "numeric"
data$LOCATION_LONGITUDE <- as.numeric(as.character(data$LOCATION_LONGITUDE), na.rm = TRUE)
class(data$LOCATION_LONGITUDE)# check if the variable is numeric
## [1] "numeric"
The variable NUMBER_EC_CYCLES could refer to the number of electronic control cycles used in the incident. This explains why it consists of both numerical and categorical inputs. Electronic control devices could include Tasers which are known to use electrical current to immobilize a subject. The number of cycles used may provide insight into the level of force used and could be relevant to an investigation of the incident. It also has more than one numerical input in some entries. This variable was excluded from the transformation as some information would have been lost if it were transformed. It remained with the classification of category.
unique(data$NUMBER_EC_CYCLES)
## [1] "NULL" "1" "3" "2" "4" " 2, 4" "5" "0" " 1, 1"
## [10] " 3, 2" " 3, 3" "6"
class(data$NUMBER_EC_CYCLES)
## [1] "character"
The transformation of the variables to the right form was done successfully as shown below.
str(data)
## 'data.frame': 2383 obs. of 47 variables:
## $ INCIDENT_DATE : Date, format: "0016-09-03" "0016-03-22" ...
## $ UOF_NUMBER : num 37702 33413 34567 31460 NA ...
## $ OFFICER_ID : num 10810 7706 11014 6692 9844 ...
## $ OFFICER_GENDER : chr "Male" "Male" "Male" "Male" ...
## $ OFFICER_RACE : chr "Black" "White" "Black" "Black" ...
## $ OFFICER_HIRE_DATE : Date, format: "0014-05-07" "0099-01-08" ...
## $ OFFICER_YEARS_ON_FORCE : num 2 17 1 24 7 7 7 9 4 8 ...
## $ OFFICER_INJURY : chr "No" "Yes" "No" "No" ...
## $ OFFICER_INJURY_TYPE : chr "No injuries noted or visible" "Sprain/Strain" "No injuries noted or visible" "No injuries noted or visible" ...
## $ OFFICER_HOSPITALIZATION : chr "No" "Yes" "No" "No" ...
## $ SUBJECT_ID : num 46424 44324 45126 43150 47307 ...
## $ SUBJECT_RACE : chr "Black" "Hispanic" "Hispanic" "Hispanic" ...
## $ SUBJECT_GENDER : chr "Female" "Male" "Male" "Male" ...
## $ SUBJECT_INJURY : chr "Yes" "No" "No" "Yes" ...
## $ SUBJECT_INJURY_TYPE : chr "Non-Visible Injury/Pain" "No injuries noted or visible" "No injuries noted or visible" "Laceration/Cut" ...
## $ SUBJECT_WAS_ARRESTED : chr "Yes" "Yes" "Yes" "Yes" ...
## $ SUBJECT_DESCRIPTION : chr "Mentally unstable" "Mentally unstable" "Unknown" "FD-Unknown if Armed" ...
## $ SUBJECT_OFFENSE : chr "APOWW" "APOWW" "APOWW" "Evading Arrest" ...
## $ REPORTING_AREA : num 2062 1197 4153 4523 2167 ...
## $ BEAT : num 134 237 432 641 346 235 132 515 133 614 ...
## $ SECTOR : num 130 230 430 640 340 230 130 510 130 610 ...
## $ DIVISION : chr "CENTRAL" "NORTHEAST" "SOUTHWEST" "NORTH CENTRAL" ...
## $ LOCATION_DISTRICT : chr "D14" "D9" "D6" "D11" ...
## $ STREET_NUMBER : num 211 7647 716 5600 4600 ...
## $ STREET_NAME : chr "Ervay" "Ferguson" "bimebella dr" "LBJ" ...
## $ STREET_DIRECTION : chr "N" "NULL" "NULL" "NULL" ...
## $ STREET_TYPE : chr "St." "Rd." "Ln." "Frwy." ...
## $ LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION: chr "211 N ERVAY ST" "7647 FERGUSON RD" "716 BIMEBELLA LN" "5600 L B J FWY" ...
## $ LOCATION_CITY : chr "Dallas" "Dallas" "Dallas" "Dallas" ...
## $ LOCATION_STATE : chr "TX" "TX" "TX" "TX" ...
## $ LOCATION_LATITUDE : num 32.8 32.8 32.7 NA NA ...
## $ LOCATION_LONGITUDE : num -96.8 -96.7 -96.9 NA NA ...
## $ INCIDENT_REASON : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ REASON_FOR_FORCE : chr "Arrest" "Arrest" "Arrest" "Arrest" ...
## $ TYPE_OF_FORCE_USED1 : chr "Hand/Arm/Elbow Strike" "Joint Locks" "Take Down - Group" "K-9 Deployment" ...
## $ TYPE_OF_FORCE_USED2 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED3 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED4 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED5 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED6 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED7 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED8 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED9 : chr "" "" "" "" ...
## $ TYPE_OF_FORCE_USED10 : chr "" "" "" "" ...
## $ NUMBER_EC_CYCLES : chr "NULL" "NULL" "NULL" "NULL" ...
## $ FORCE_EFFECTIVE : chr " Yes" " Yes" " Yes" " Yes" ...
## $ INCIDENT_DATE_AND_TIME : POSIXlt, format: "2016-09-03 04:14:00" "2016-03-22 23:00:00" ...
The number of missing values can affect the choice of data analysis techniques. Knowing the number of missing values can help to determine which analysis techniques are appropriate and ensure that the results are valid.
It is observed that there are 1756 missing variables. From this observation, it is seen that only 4 variables had missing inputs. This is observation does not act as a huge barrier for us to conduct a conclusive analysis. Further discussion and exploration on what the missing variables is done in the next section.
sum(is.na(data)) # there are 1756 missing values in the data frame
## [1] 1756
sapply(data, function(x) sum(is.na(x)))#No. of missing values in each column of the data frame
## INCIDENT_DATE
## 0
## UOF_NUMBER
## 1636
## OFFICER_ID
## 0
## OFFICER_GENDER
## 0
## OFFICER_RACE
## 0
## OFFICER_HIRE_DATE
## 0
## OFFICER_YEARS_ON_FORCE
## 0
## OFFICER_INJURY
## 0
## OFFICER_INJURY_TYPE
## 0
## OFFICER_HOSPITALIZATION
## 0
## SUBJECT_ID
## 0
## SUBJECT_RACE
## 0
## SUBJECT_GENDER
## 0
## SUBJECT_INJURY
## 0
## SUBJECT_INJURY_TYPE
## 0
## SUBJECT_WAS_ARRESTED
## 0
## SUBJECT_DESCRIPTION
## 0
## SUBJECT_OFFENSE
## 0
## REPORTING_AREA
## 0
## BEAT
## 0
## SECTOR
## 0
## DIVISION
## 0
## LOCATION_DISTRICT
## 0
## STREET_NUMBER
## 0
## STREET_NAME
## 0
## STREET_DIRECTION
## 0
## STREET_TYPE
## 0
## LOCATION_FULL_STREET_ADDRESS_OR_INTERSECTION
## 0
## LOCATION_CITY
## 0
## LOCATION_STATE
## 0
## LOCATION_LATITUDE
## 55
## LOCATION_LONGITUDE
## 55
## INCIDENT_REASON
## 0
## REASON_FOR_FORCE
## 0
## TYPE_OF_FORCE_USED1
## 0
## TYPE_OF_FORCE_USED2
## 0
## TYPE_OF_FORCE_USED3
## 0
## TYPE_OF_FORCE_USED4
## 0
## TYPE_OF_FORCE_USED5
## 0
## TYPE_OF_FORCE_USED6
## 0
## TYPE_OF_FORCE_USED7
## 0
## TYPE_OF_FORCE_USED8
## 0
## TYPE_OF_FORCE_USED9
## 0
## TYPE_OF_FORCE_USED10
## 0
## NUMBER_EC_CYCLES
## 0
## FORCE_EFFECTIVE
## 0
## INCIDENT_DATE_AND_TIME
## 10
These missing values can be represented visually as shown below
The variables in the data set contain inputs of various incidents that were handled by the police in Dallas, Texas in 2016. There were 2383 incidents in that year. The time and place these incidences occurred are given. Various police officers handled the matter and they are represented by the OFFICER_ID variable. These incidences also involved subjects who are represented by a SUBJECT_ID. Characteristics of the officers and subjects involved are given by various variables. Other variables describe the actions and events that took place in a given incident.
As it was mentioned above the significance of the missing variables is not very strong given that they only consist of a small part of the entire data set.
The variables with missing values and the number of missing inputs they have is as follows:
INCIDENT_DATE_AND_TIME has 10 missing values
UOF_NUMBER has 1636 missing values
LOCATION_LATITUDE has 55 missing values
LOCATION_LONGITUDE has 55 missing values
LOCATION_LATITUDE and LOCATION_LONGITUDE have the same number of missing values. Aside from that they also have a similar pattern of missing inputs. The data set has other variables such as STREET_NUMBER and STREET_NAME which represents location. Thus this is observation not an alarming matter.
mytable <- table(data$SUBJECT_RACE, data$SUBJECT_WAS_ARRESTED)
mytable
##
## No Yes
## American Ind 0 1
## Asian 0 5
## Black 189 1144
## Hispanic 73 451
## NULL 13 26
## Other 3 8
## White 57 413
From the table above it can be seen that most of the subjects arrested are black. It is also interesting how no arrest has been made on ‘American Ind’ and ‘Asian’ subjects whenever they are confronted by the police.
library(ggplot2)
num_arrests_by_gender <- data %>%
filter(SUBJECT_WAS_ARRESTED == "Yes") %>%
group_by(SUBJECT_GENDER) %>%
summarize(num_arrests = n())
num_arrests_by_gender
## # A tibble: 4 × 2
## SUBJECT_GENDER num_arrests
## <chr> <int>
## 1 Female 378
## 2 Male 1661
## 3 NULL 8
## 4 Unknown 1
library(ggplot2)
library(scales)
# Calculate the total number of arrests
total_arrests <- sum(num_arrests_by_gender$num_arrests)
# Create a bar plot of the number of arrests by gender
ggplot(num_arrests_by_gender, aes(x = SUBJECT_GENDER, y = num_arrests/total_arrests, fill = SUBJECT_GENDER)) +
geom_bar(stat = "identity") +
labs(title = "Number of Arrests by Gender", x = "Gender", y = "Percentage of Arrests") +
# Set the y-axis limits to 0-1 and format the labels as percentages that sum up to 100%
scale_y_continuous(limits = c(0, 1), labels = percent_format(accuracy = 1)) +
# Add text labels at the top of the bars
geom_text(aes(label = num_arrests), position = position_stack(vjust = 0.5))
From the plot above it is seen that the male gender has the most arrests. The difference in arrests between the two genders is quite significant. It is seen that 1661 men are arrested whereas 378 women are arrested.Female arrests constitute of less than 25% of the arrests
library(ggplot2)
# Calculate the number of officers who are injured
num_officers_injured <- data %>%
filter(OFFICER_INJURY == "Yes") %>%
group_by(OFFICER_ID) %>%
summarize(num_injuries = n())
num_officers_injured
## # A tibble: 203 × 2
## OFFICER_ID num_injuries
## <dbl> <int>
## 1 4312 1
## 2 5170 1
## 3 5230 1
## 4 5258 2
## 5 5260 2
## 6 5889 1
## 7 5909 1
## 8 5927 1
## 9 6044 1
## 10 6231 1
## # … with 193 more rows
sum(num_officers_injured$num_injuries)
## [1] 234
# Create a histogram of the number of officers who are injured
ggplot(num_officers_injured, aes(x = num_injuries)) +
geom_histogram(binwidth = 1, color = "black", fill = "lightblue", alpha = 0.7) +
labs(title = "Number of officers who are injured", x = "Number of Injuries", y = "Count") +
theme_minimal()
A total of 234 police officers incur injuries. From the plot above it is seen that most officers who incur injuries incur one injury. This shows that police officer injury is not very common.
# Calculate the number of officers who are involved in subject injury
num_subject_injured <- data %>%
filter(SUBJECT_INJURY == "Yes") %>%
group_by(OFFICER_ID) %>%
summarize(num_injuries = n())
num_subject_injured
## # A tibble: 451 × 2
## OFFICER_ID num_injuries
## <dbl> <int>
## 1 0 1
## 2 4312 1
## 3 4665 1
## 4 5066 1
## 5 5073 1
## 6 5170 1
## 7 5260 2
## 8 5395 1
## 9 5485 2
## 10 5541 1
## # … with 441 more rows
sum(num_subject_injured$num_injuries)
## [1] 629
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- ggplot(num_subject_injured, aes(x = OFFICER_ID, y = num_injuries)) +
geom_point(color = "red", alpha = 0.7) +
labs(title = "Scatter Plot of Number of subject injuries involved by an officer", x = "Officer ID", y = "Number of Injuries") +
theme_minimal()
ggplotly(p)
629 officers are involved in subject injury. From the scatter plot it is seen that most injuries have occurred once. This means that most of the officers involved have only been part of a such a case once. There are those that have done it more than once.The red dots represent officers. Officers are expected to use force only when necessary and to use the minimum amount of force needed to accomplish their objectives. The officer with ID number 10818 has been involved in a total of 7 injuries an investigation into their actions is warranted and potentially disciplinary action or criminal charges could be imposed if they are found to have acted improperly.
# Create a subset of the data with the necessary variables
force_years <- data[, c("REASON_FOR_FORCE", "OFFICER_YEARS_ON_FORCE")]
# Create the box plot
ggplot(force_years, aes(x = OFFICER_YEARS_ON_FORCE, y = REASON_FOR_FORCE)) +
geom_boxplot() +
labs(title = "Reason for Use of Force by Officer Years on Force",
x = "Officer Years on Force",
y = "Reason for Use of Force") +
theme_minimal()
From the plot above it is seen that officers that use force more often have been on force for 5 to 10 years whereas those that have been on force for the longer years rarely use force. Crowd disbursement results in the use of force by more experienced police officers and this could be because of their experience in such situations. Despite being more observed in police with less experience outliers are observed in weapon display, other cases, danger to self or others and arrest. With arrest having the most. The less frequent use of force by police officers with more years of experience could be attributed to them having a better ability to deal with incidences without force. It also seems that in the case of aggressive animals, all officers of all experience levels use force.
Correlation is a measure of linear association between two continuous variables. Thus only numerical variables where considered. To include some key categorical variables we converted them to factors. OFFICR_ID and ,SUBJECT_ID were the only numerical variables that were not included.BEACT, SECTOR and REPORTING_AREA were removed as they are communicating the same information.
Given that we have several variables that we can use to calculate correlations with, we opted to only focus on the top 10 variables to obtain a more informative visual representation.
library(corrplot)
## corrplot 0.92 loaded
library(ggcorrplot)
library(reshape2)
attach(data)
# Convert the vector to a factor
Officerrace<- factor(OFFICER_RACE)
subjectarrest <- factor(SUBJECT_WAS_ARRESTED)
officergender <- factor(OFFICER_GENDER)
subjectrace <- factor(SUBJECT_RACE)
subjectgender<- factor(SUBJECT_GENDER)
subjectinjury <- factor(SUBJECT_INJURY)
officerinjury <- factor(OFFICER_INJURY)
# Subset the dataset to numerical variables only
subdata <- data.frame(UOF_NUMBER,OFFICER_YEARS_ON_FORCE,BEAT, INCIDENT_DATE_AND_TIME, OFFICER_HIRE_DATE, Officerrace, subjectarrest, officergender, subjectrace, subjectgender, subjectinjury, officerinjury )
# convert the data frame to numeric
subdata <- apply(subdata, 2, as.numeric)
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
## Warning in apply(subdata, 2, as.numeric): NAs introduced by coercion
cor_matrix <- cor(subdata)
# Creating a matrix of the same dimensions as cor_matrix, but with all entries set to 0
mask <- matrix(0, nrow(cor_matrix), ncol(cor_matrix))
# Setting entries in the mask matrix to 1 for correlations that are above the threshold (0.7)
mask[abs(cor_matrix) > 0] <- 1
# Creating a heat map of the correlation matrix, with high correlations highlighted in red
library(ggplot2)
ggplot2::ggplot(data = reshape2::melt(cor_matrix)) +
ggplot2::geom_tile(ggplot2::aes(x = Var1, y = Var2, fill = value)) +
ggplot2::geom_text(data = reshape2::melt(cor_matrix * mask),
ggplot2::aes(x = Var1, y = Var2, label = round(value, 2)),
color = "red") +
ggplot2::scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation\nCoefficient") +
ggplot2::theme_minimal() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 1,
size = 10, hjust = 1)) +
ggplot2::coord_fixed()
## Warning: Removed 130 rows containing missing values (`geom_text()`).
It can be seen that most variables do not have high correlations however beat and officer years on force have a mid-level correlation as specified in the code that generated the plot. This could be explained by the fact that. A possible explanation could be that officers with more years on the force may have developed closer relationships with members of the community they serve, and may therefore have a greater understanding of the needs and concerns of residents in specific areas or beats. This could lead to more effective policing and a lower crime rate, which could in turn lead to a correlation between officer experience and beat.
library(plotly)
plot_ly(data , x = ~OFFICER_YEARS_ON_FORCE, y = ~INCIDENT_DATE, color = ~OFFICER_RACE, size = ~UOF_NUMBER,
mode = "markers", type = "scatter") %>%
layout(title = "Number of Incidents vs Officer Years on Force",
xaxis = list(title = "Officer Years on Force"),
yaxis = list(title = "Incident Date"))
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
## Warning: `line.width` does not currently support multiple values.
There is no single variable that represents the number of incidents reported. However, there exists a variable that gives us the date these incidents occur on a given day. Using this information helps us group incidences by day to generate the number of incidences that have occurred on a given day.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.1.8
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ plotly::filter() masks dplyr::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(plotly)
# Group data by day and calculate total incidents
daily_incidents <- data %>%
group_by(INCIDENT_DATE) %>%
summarise(total_incidents = n())
# Create time series plot with plotly
plot_ly(daily_incidents, x = ~INCIDENT_DATE, y = ~total_incidents, type = "scatter", mode = "lines") %>%
layout(title = "Total Incidents per Day",
xaxis = list(title = "Date"),
yaxis = list(title = "Total Incidents"),
colorway = c("blue", "red", "green", "orange", "purple")) %>%
highlight(on = "plotly_hover", dynamic = TRUE)
## Adding more colors to the selection color palette.
From the plot above time series plot it can be seen that the highest number of incidents occurred on the 30th of September. It is followed by the 14th of February.. Valentine’s Day is a holiday that is traditionally associated with romance and love. For some people, this day can be a source of stress, anxiety, or disappointment. This could lead to an increase in domestic disputes or other incidents that require police intervention. Special Events and activities often take place on this day and they often require a large police presence. These reasons could explain the spike in reported incidences.
# Create a scatter plot with a loess smoothing line
ggplot(data, aes(x = OFFICER_YEARS_ON_FORCE, y = INCIDENT_DATE)) +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Officer Years on Force", y = "Number of Incidents",
title = "Relationship between Officer Experience and Incidents")
## `geom_smooth()` using formula = 'y ~ x'
From the plot above it cn be seen that the officers with less experience report the most incidences. This could be explained by the fact that newer officers tend to have a bias to report more cases that they observe due to their low level of experience in discerning what counts as a case. They may also use excessive force as seen in the box plot above thus leading to a higher number of incidences.
The smoothing line being in the middle of the graph shows us that there is no clear linear relationship between Officer Experience and Incidents reported. This could indicate that the variables are not strongly correlated. Saying that police officers with less experience report more cases is a major generalization that ought to be avoided.Other variables could cause the surge of incidences among officers with lower experience.
The map below shows us the areas that have been received police reports in Dallas, Texas. By Visualizing cases on a map, we can see where they are concentrated and whether they are clustered in particular regions or areas. This can help us identify crime or incident hotspots that may require further investigation. Mapping can also help us understand the relationship between cases and other physical features that define the locale of an area such as the presence of schools, entertainment spots and even the population density. Areas with high incidences could help the police force make informed decisions such as where they should deploy more reinforcement.
# Load the required packages
library(leaflet)
# Create a leaflet map
leaflet(data ) %>%
addTiles() %>%
addMarkers(~LOCATION_LONGITUDE, ~LOCATION_LATITUDE, popup = ~REPORTING_AREA)
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with either
## missing or invalid lat/lon values and will be ignored
Creating a map that shows Incident Reason, Officer Race, Subject Race, and Subject Offense allows us to visually analyze and understand the relationships between these variables in a geographic context. Mapping incident reasons can provide insights into the types of incidents that occur in different areas. An incident reason like bulglary could indicate that houses there need better security systems. Including subject race in the map can help us understand if certain incident reasons are more prevalent in areas with a particular demographic makeup. This could indicate the existence of systemic issues that need to be addressed. The variable subject offence gives more context to the incidence being addressed.
library(leaflet)
leaflet(data) %>%
addTiles() %>%
addMarkers(lng = ~LOCATION_LONGITUDE, lat = ~LOCATION_LATITUDE,
popup = ~paste("Incident Reason: ", INCIDENT_REASON,
"<br>Subject Race: ", SUBJECT_RACE,
"<br>Subject Offense: ", SUBJECT_OFFENSE
))
## Warning in validateCoords(lng, lat, funcName): Data contains 55 rows with either
## missing or invalid lat/lon values and will be ignored
From the plot above we can see that within the same local the following cases have been reported:
| Variable | Input | Input |
|---|---|---|
| Incident Reason | Arrest | Arrest |
| Subject Race | Black | Black |
| Subject Offense | Drug Possession - Misdemeanor, Evading Arrest, Warrant/Hold | Assault/FV |
This shows us that most perpetrators in the area are black and are involved in violent crimes that involve drug cases that end in arrest.